Conversation
|
@andreidan I think this goes a little too deep down the rabbit hole with the implementation. We should start with a minimum workable version and go from there. My idea of how this would work is:
And then, when noticing that the step is in the ERROR step we move the index back to the non-failed step with the existing This leaves us with something that works for some of the steps, but doesn't solve the For now though, I think we can add the retry only to the I also don't think we need the exponential back-off (at least not in the first iteration) since we have the poll interval and cluster state update running that space out the retries a bit. I also think we need to expose that ILM is on a retried step somewhere in the explain API, as well as put a message in the logs about moving out of the ERROR step because the step is retryable. @andreidan @gwbrown - thoughts? |
|
I agree with @dakrone that I don't think we need the exponential backoff at this point - waiting for the poll interval and/or waiting on cluster state updates is probably good enough. Given that, I think we can keep any retry metadata in the (Aside: Now that I think about it, we could implement exponential backoff that way too without having to keep local state: If we keep the retry count in the Regarding a concern you brought up elsewhere about having to parse the exception back out of the serialized version: one way to avoid doing that would be to make the decision at the time we move into the error step in the first place - one way to do that would be to add a field |
|
Thanks for your input @dakrone @gwbrown. I think moving the cluster state to the failed step only, without explicitly executing the step, and using the |
|
Superseded by #48256 |
Sketch for retrying failed ILM steps.
This is very much a draft meant to illustrate how identifying retryable steps and attempting the retry would look like if we handle them in the IndexLifecycleRunner.